Credit Card Fraud: A Tidymodels Tutorial

2023-04-13 04:50| 来源: 网络整理| 查看: 265

Why do we care about the number of categories and whether they are “excessive”?

Consider the extreme case where a dataset had categories that only contained one record each. There is simply insufficient data to make correct predictions using category as a predictor on new data with that category label. Additionally, if your modeling uses dummy variables, having an extremely large number of categories will lead to the production of a huge number of predictors, which can slow down the fitting. This is fine if all the predictors are useful, but if they aren’t useful (as in the case of having only one record for a category), trimming them will improve the speed and quality of the data fitting.

If I had subject matter expertise, I could manually combine categories. For example, in this dataset, the three largest categories in job are surveying-related and perhaps could be combined. If you don’t have subject matter expertise, or if performing this task would be too labor intensive, then you can use cutoffs based on the amount of data in a category. If the majority of the data exists in only a few categories, then it might be reasonable to keep those categories and lump everything else in an “other” category or perhaps even drop the data points in smaller categories. As a side note, the forcats package has a variety of tools to handle consolidating and dropping levels based on different cutoffs if this is the approach you decide to take.

One way to evaluate the compactness of a factor is to group the data by category and look at a table of counts. I like the gt package for making attractive tables in R. (Uncomment the line in Code Block 7 #gt:::as.tags.gt_tbl(table_3a) to see the table.) The tabular data also shows that there aren’t typos leading to duplicate categories.

Another way to evaluate the compactness is to make a cumulative plot. This looks at the proportion of data that is described as you add categories. I’m using the cowplot package to make multipanel figures. I want to look at both factors at once; this is fine for exploratory data analysis, but I wouldn’t recommend it for a report or presentation, since there is no connection between the two variables.

# Code Block 7: Exploring the Compactness of the Categories # Exploring the jobs factor # bin and count the data and return sorted table_3a_data % count(job, sort = TRUE) # creating a table to go with this, but not displaying it table_3a % gt() %>% tab_header(title = "Jobs of Card Holders") %>% cols_label(job = "Jobs", n = "Count") %>% opt_stylize(style = 1, color = "green", add_row_striping = TRUE) #gt:::as.tags.gt_tbl(table_3a) #displays the table fig_1a % tab_header(title = "Transaction Category in Credit Card Fraud") %>% cols_label(category = "Category", n = "Count") %>% opt_stylize(style = 1, color = "blue", add_row_striping = TRUE) #%>% #gt:::as.tags.gt_tbl(table_3b) fig_1b

【本文地址】

公司简介

联系我们